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Abstract 

The paper deals with a semiparametric generalized partially linear regression model with 
unknown regression coefficients and an unknown nonparametric function. We present a 
maximum penalized likelihood procedure to estimate the components of the partial linear 
model introducing penalty based wavelet estimators. Asymptotic rates of the estimates of 
both the parametric and the nonparametric part of the model are given and quasi-minimax 
optimality is obtained under usual conditions in literature. We establish in particular that the 
-penalty leads to an adaptive estimation. An algorithm is also proposed for implementation 
and simulations are used to illustrate the results. 
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1 Introduction 



The aim of this paper is to consider a regression model, where the response Y is to be predicted 
by covariates z, with Y real-valued and with z a real explanatory vector. Relaxing the usual 
assumption of normality, we consider a generalized framework. The response value y is drawn 
from a one-parameter exponential family of distributions, with a probabilistic density of the form: 

exp ( '""^' ' +c(y,j,)). (1) 

In this expression, b{-) and c(-) are known functions, which determine the specific form of the 
distribution. The parameter ^ is a dispersion parameter and is also supposed to be known in 
what follows. The unknown function rj{-) is the natural parameter of the exponential family, 
which carries information from the explanatory variables. Given a random sample of size n drawn 
independently from a generalized regression model, the aim is to predict the function fj{-). Such a 
model gives a large scope of applications because observations can result from many distribution 
families such as Gaussian, Poisson, Binomial, Gamma, etc. For a more thorough description 
of generalized regression modelling, we refer to McCullagh & Nelder (1989) or Fahrmeir & Tutz 
(1994). 

Let (Y;, z,),=i^ „ be an independent random sample drawn from a generalized regression model. 
The conditional mean and variance of the i"^ response Y/ are given by: 

E[Y;|Z,] = b{f]{Zi)) = ^{Zi), (2) 

Var[Yi\zi] = fHvi^i))' (3) 

where b{-) and fc(-) denote respectively the first and second derivatives of &(■). The function 
G = b^^ is called link function and one have G(E(Y;|zj)) = Tl{zi). 

A linear model consists in assuming that the dependence from the covariate z is linear, meaning 
r]{z) can be written on the form i]{z) = z^)S ; the superscript T denotes the transpose of a vector 
or matrix. Yet, in some applications, the linear model is insufficient to explain the relationship 
between the response variable and its associate predictors. The generalized functional model relax 
this linearity, considering a nonparametric form for the canonical parameter, say ?/(z) = /(z). 
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Nevertheless in such a model appears the well-known curse of dimensionality. To avoid this 
drawback, we allow most predictors to be modelled linearly, while a small number is modelled 
nonparametrically. To this aim we decompose the covariate z in two components: in the following 
z = {X,t), with X a p-dimensional vector and t a real-valued covariate. The function ?/(•) is given 
by: 

G(/Y(X,f)) =?7(X,0 =X^i8 + /(0, (4) 

where )3 is an unknown p-dimensional real parameter vector and /(■) is an unknown real- valued 
function; such a model is called a generalized partially linear model (GPLM). Given the observed 
data (y„ X„ i;)i=i,...,^, the aim is then to estimate from the data the vector )S and the function /(•). 

In a Gaussian modelisation. Rice (1986) and Speckman (1988) put in evidence that the rates of 
the estimates for linear and nonlinear parts could not be both optimized without a control of the 
correlation between the explanatory variable of the linear part and the functional part of the model. 
With such a control, Speckman (1988) proves that it is possible to obtain both optimal linear rate 
and nonparametric rate for the estimates. To my knowledge, the only paper establishing such a 
result in GPLM is Mammen & Van der Geer (1997). 

Many papers focus on the asymptotic behaviour of the estimator of the parametric part )8 in 
generalized partially linear models (see e.g. Chen (1987) by a penalized quasi-least squares or 
Severini & Staniswalis (1994) by profile likelihood methods). A recent article of Boente et al. (2006) 
establishes a uniformly convergent estimation for / using a robust profile likelihood similar to 
Severini & Staniswalis (1994). But few works consider simultaneously the parametric and the 
nonparametric part of the model. The paper of Mammen & Van der Geer (1997) shows minimax 
optimality for the estimations of both / and ^ with a penalized quasi-least squares procedure. Note 
that the authors use a Sobolev type penalty. The conditions given there for attaining optimality for 
both parametric and nonparametric estimators appear to be more restrictive than those given in 
Gaussian partially linear models in the literature. 

This paper proposes a new estimation procedure based on wavelet decompositions. Wavelet 
based estimators for the nonparametric component of a Gaussian partially linear model have been 
investigated by F. Meyer (2003), Chang & Qu (2004), Fadili & BuUmore (2005) or Gannaz (2007b) 
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and Gannaz (2007a) more recently. But it has not been studied in the content of GPLM. Estimation 
methods encountered in literature need the choice of a smoothing parameter, which optimal value 
depends on the regularity of the functional part /. A cross-validation procedure is then necessary to 
evaluate this parameter. As noted by Rice (1986) or Speckman (1988), cross-validation can present 
much instability in partially linear models. The use of wavelets here leads to a procedure where 
no cross-validation is needed. This adaptivity is the main novelty of our estimation scheme in 
such models. We moreover establish the near-minimax optimality of the estimation for both the 
linear predictor )3 and the nonparametric part /, under usual assumptions of correlation between 
the two parts. The correlation condition appears to be similar to what is classically for the Gaussian 
case and is weaker than in Mammen & Van der Geer (1997). Finally, we present an algorithm for 
computing the estimates. 

The paper is organized as follows: Section 2 presents the assumptions and the estimation 
procedure. It also gives the main properties of our estimators. We distinguish two cases: non 
adaptive penalties and a £^ type penalty, which leads to adaptivity. In Section 3, we propose a 
computational algorithm of the adaptive estimation procedure and we present a small simulation 
study for the numerical implementation. Proofs of our results are given in the Appendix. 

2 Assumptions and estimation scheme 

We consider a generalized regression model, where the response Y depends on covariates (X, t), 
where Y is real- valued, X is a p-dimensional vector and Hs a real-valued covariate. The value y is 
drawn from an exponential family of distributions, with a probabilistic density of the form: 



where functions fo(-) and c(-) as well as the dispersion parameter (p are supposed to be known. The 
aim is to evaluate the unknown generating function ?/(•). As noted above, we are interested in this 
paper by a partially linear modelisation of the function ?/ ( • ) and hence we suppose that it has the 
semiparametric expression: 
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The vector j8 and the function / are respectively the parametric and nonparametric components 
of the generalized partially linear model (GPLM). In the following, (Y^X;, i;) will denote an 
independent random sample drawn from the GPLM described here. 

2.1 Penalized maximum likelihood 

The aim of the paper is to estimate simultaneously the parameter j8 and the function /, given the 
observed data {Yi,\i, ^Oi^i,...,"- propose a penalized maximum loglikelihood estimation. Let £ 
denotes the loglikelihood function: i{y,ri) = ytj — b{r]). We consider throughout the paper that the 
estimators /„ and are solutions of : 

{f„,l„)= argmax Kn{f,^) with Kn{f, ^) = f^£ U,Xj ^ + f{ti)) - Pen{f). (5) 

{/,||/||oo<C™},i8GlRP i=l 

We refer to Antoniadis et al. (2009) for the conditions of existence of such a maximization problem. 
In what follows, we will assume the penalty is convex and the likelihood function is bounded in 
order to ensure the existence of maxima (unicity is not acquired but there is no local maxima). We 
did not succeed in getting rid of the constraint ||/||oo < Coo in the proofs but such a condition does 
not seem too restrictive in practice. 

The computation of the maximization problem is done in two step. Some studies, such as Speckman 
(1988), incite to estimate first the functional part. Thus, we will proceed as follows: 

^- fnB= argmax K„{f,^). 

{/,||/l|co<Coo} 

2. )S„ = argmax K„{fn, Actually, a classical procedure is here to maximise a modified 

criterion called profile likelihood (see among others Severini & Wong (1992) or Boente et al. 
(2006)). The criterion maximized is then ^^Li ^(y/, ^(XfjS + /^(fi))) - Pen(/). The expression 
of b{Xj^ + f^{ti)) can indeed be simplified using first order conditions of step 1. Due to the 
non-linearity of our procedure, we choose here to keep a loglikelihood approach. 

3. fn = f?) ■ 
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Note also that an usual estimation procedure used in generalized models is quasi-likelihood 
maximization. For details, we refer among others to Severini & Staniswalis (1994) or 
McCuUagh & Nelder (1989). In GPLM, quasi-likelihood estimation was developed for example 
by Chen (1987) and Mammen & Van der Geer (1997). 

2.2 Discrete wavelet transform 

The aim of the present work is to introduce a wavelet penalty in estimation. The idea is to use the 
wavelet representation of the functional part we wish to estimate through this penalty. For more 
precision on wavelets, the reader is referred to Daubechies (1992), Y. Meyer (1992) or Mallat (1999). 

Let (L^[0, 1], (■,•)) be the space of squared-integrable functions on [0,1] endowed with the inner 
product {f,g) = JjQ f{t)g{t) dt. Throughout the paper we assume that we are working within 
an r-regular (r > 0) multiresolution analysis of (L^[0, 1], (•, ■)), associated with an orthonormal 
basis generated by dilatations and translations of a compactly supported scaling fimction, (p{t), 
and a compactly supported mother wavelet, tp{t). For simplicity reasons, we will consider periodic 
wavelet bases on [0, 1]. 

For any ; > and k = 0, 1, . . . , 2^' - 1, let us define (pj^kit) = l''^(p{2H - k) and ^j,k{i) = 
2i^^xp{2h-k). Then for any given resolution level jo > the family 

{<py„;c, = 0, 1, . . . ,2^° - 1; xPj,k, i > 7o; ^ = 0, 1, . . . ,2^' - l} 

is an orthonormal basis of L^[0, 1]. Let / be a function of L^[0, 1] ; if we denote by Cy^, j. = (/, (pj^^k) 
(fc = 0, 1, . . . , 2^0 _ 1) the scaling coefficients and by dy,^ = (/, ^pj^^) {j > jo; k = 0,1, ■ ■ ■ , 2' - 1) the 
wavelet coefficients of /, the function / can then be decomposed as follows: 

2'0-l 00 2' -I 

/(O = E ^iom,kit) + E E dj,knk{t), t G [0,1]. 

k=0 j=jo k=0 

Yet in practice, we are more concerned with discrete observation samples rather than continuous. 
Consequently we are more interested by the discrete wavelet transform (DWT). Given a vector of 
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real values e = (ej, . . . , e„)^, the discrete wavelet transform of e is given hy = Y^x^e, where is 
an n X 1 vector comprising both discrete scaling coefficients, ^, and discrete wavelet coefficients. 
The matrix Y„xn is an orthogonal n x n matrix associated with the orthonormal periodic 
wavelet basis chosen, where one can distinguish the Blocs spanned respectively by the scaling 
functions and the wavelets. 



Note that if F is a vector of function values F = (/(^i), . . . ,f{tn)Y equally spaced points ^^, then 
the corresponding empirical coefficients 0? ^ and are related to their continuous counterparts 
and dy^c with a factor n^^''^. It is worthy to remark also that because of orthogonality of ^nxn, 
the inverse DWT is simply given by F = Y^xii^- If n = 2^ for some positive integer /, Mallat 
(1989) propose a fast algorithm, that requires only order n operations, to compute the DWT and the 
inverse DWT. 



2.3 Assumptions and asymptotic minimaxity 

Let 1 1 . 1 1 denotes the euclidean norm on RP and \\h\\^j = ^ ^ ( f , ) ^ f or any function h . To ameliorate 
the comprehension of the results and the assumptions, the subscript will identify in the following 
the true values of the model. 

Due to the use of the Discrete Wavelet Transform described above, we will consider in the following 
that the functional part is observed on an equidistant sample f, = ^, and that the sample size satisfy 
n = 2^ for some positive integer /. 

We first introduce assumption (Al) which ensures the identifiability of the model: 

(Al) ^X^X converges to a strictly positive matrix when n goes to infinity, and ^X^Fq goes to 
when n goes to infinity, with Fo = (/o (^i )/•••/ /o (^n ) ) • 

Define H = X(X^X)^^X^ the projection matrix on the space generated by the columns of X. The 
matrix H admits a rank and thus a trace equal to p. If hj = Xi{X^X)^^Xj denotes the z^'' diagonal 
term of H, this means that ^//j = p. We moreover suppose that: 

(A2) h = maxj=i . „ hi — > 
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This assumption is very usual (see e.g. Huber (1981)). 



Concerning the loglikelihood function, we assume that: 

(A3) sup||,^_,^^||_^<2(Cco+||/||oo)S"P! ''^'('7!) < bco < 00. We also suppose that min;fc(Xf^o+ /o(f!) ) > 

n 

bo > and max,fo(X[^o +/o(fO) < ^^oc < c«. Moreover, }-^J^b{Xj + fo{t,))XjXi 



converges to a positive matrix Eq. 



i=i 



Recall that the function b{-) is associated to the variance of the observations and that on have b > 0. 
Then some more restrictive assumptions are made on the form of the distribution: 



(A4.1) There exist a constant fl > such that max,=i „ E exp(£(y„ Xj^ + f{ti)f/a) 

(A4.2) There exist constants ^, Oq > such that 



< a, 



max E[exp £(¥,-, X/ j6 + /(i,)) 

i=l,...,n V V 

Assumption (A4.1) corresponds to exponential tails and is weaker than assumption (A4.2), which 
corresponds to sub-Gaussian tails. 

We aim to control the correlation between the linear and the nonparametric parts of the model. 
Following Rice (1986) or Speckman (1988) we decompose the components of the design matrix 
X into a sum of a deterministic function of L^[0, 1] and a noise term. More precisely, the {i,])- 
component of X, say Xy, is supposed to take the form = gi{tj) + with functions {gi)i=i,...,p 
forming an orthogonal family on (L^[0, 1], (•, •)) and with ^/y denoting a realization of a random 
variable The variables (f,);=i,...,M are supposed to be centered, independent, with finite variance. 
We can easily see that the orthogonality of the family (§';);=i,...,p ensures that the matrix ^X^X goes 
to a strictly positive matrix when n goes to infinity. If in addition the system {gi)i=\^,..,p satisfies 
/[o 1] /(OS'KO '^^ — ^' then assumption (Al) and consequently identifiability are guaranteed. 

We also make an assumption on the distribution of the random variables and of course we 
suppose we control the regularity of the functions gi. 



(Acorr) V; = 1, . . . , p, 2 = 1, . . . , n, X, y = gj{ti) + with polynomial functions gj of degree less or 
equal than the number of vanishing moments of the wavelet considered. For all y = 1, . . . , p, 
y),-^^ „ is a n-sample such that maxj^i . „ E 

2.3.1 Nonadaptive case 

Assume /(•) is a given criterion on the functions from [0, 1] to the positive real line. We introduce 
the function class A = {g, ]{g) < C}. Let us recall the definition of the entropy of a subspace: 

Definition 1. Let F he a subset of a metric space {C,d) of real-valued functions. The 5-covering number 
N{5, T , d) of the space T is the smallest number N such that their exist ai,...,a^ such that for each a E F 
one have d{a,aj) < S for some i G {1, . . ■ ,N}. The 5-entropy T-L{5,F,d) of the space T is defined as 
U{8,T,d) =log N{5,TJ). 

We here suppose that the ^-entropies of the subspace A for the distance associated to the norm 
II • II „ behave like 

lim sup sup II • ||„) < oo, 

for a given < v < 2. 

The penalty in equation (5) is chosen according to the two assumptions (A6) and (A7): 
(A5) For any function /z, ^Pen{h) > ]{h) with = n^'^'^+''). 
(A6) / ^ Kn{f,^) = E;Li ^(y;,Xf j8 + f{t,)) - XPen{f) is concave. 

When the special structure given in {Acorr) is assumed, we are willing to exploit it through a penalty 
on wavelet coefficients : 

(A7) The penalty Pen{h) applies only to the wavelet decomposition coefficients {dJ\)j>jg,kGZ of the 
function h. 

We are now in position to give our first asymptotic result. 



exp(^g /fly) < fly, for given constants fly > 0. 
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Theorem 2. Suppose assumptions (Al) to (A3), (A4.1), (A5) and (A6) hold and /(/o) < oo. Let fihe a 
given p-vector such that \/n\\^ — )Sq|| < c. Define fo = argmax K„{f,^). Then, 

{/,||/i|oa<Cc»},/SeIRP 

^n\\f^-fo\\n = Op(1) 

/(/^) = Op(1). 

Define )8„ = argmax Kn{fo,^). Then, 



^.||)S,-)3o|| =Op(1), 

If in addition the covariates of the linear part admit a representation of the form given in assumption [Ac 
and if the penalty satisfies (A7), then: 

V^II3„-)3oI1 = Op(i)v 



The results still hold if the number p of regression covariates goes to infinity provided the sequence 
jj(i'-2)/(i'+2)p ^pgg Q r^ii^yi gggg fg infinity and the sequence n^'"/(^+'^)p is bounded. 

The proof is given in Appendix. The main keys are controls given by Van der Geer (2000), relying 
on the entropy. 

Note that minimax optimality is obtained both for the linear predictor and the nonparametric 
estimator. The condition of correlation (Acorr) under which the optimality is attained is similar 
to the one given in Gannaz (2007a). This condition on design covariates appears to be more 
flexible than Rice's (1986) or Speckman's (1988) in the sense that the maximal degree of the 
polynomial functions intervening in the covariates depends on the number of vanishing moments 
of the wavelet, instead of depending on the regularity of the function /. Compared to 
Mammen & Van der Geer (1997), the assumptions seem much more weaker. Note that without 
correlation conditions both estimators attain nonparametric convergence rates. 

The fact that the results hold for a number of covariate p going to infinity allows to have many 
covariates in the linear part. This remark can be useful for dimension reduction modelling, where 
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the number of covariates is large. However the rate of convergence for p may be poor when the 
regularity of the function / is poor. 

In order to illustrate the general framework in which we gave the asymptotic behaviour, let 
us consider the penalty proposed in Antoniadis (1996). To exploit the sparsity of wavelet 
representations, we will assume that / belongs to a Besov space on the unit interval, S^ ,,([0, 1]), 
with s + I/ tt — 1/2 > 0. The last condition ensures in particular that an evaluation of / at a given 
point makes sense. For a detailed overview of Besov spaces we refer to Donoho & Johnstone (1998). 

Corollary 1. Suppose f belongs to a Besov hall B%-,.{R) with s > 1/2, < s — 1/2 + 1/ n, n > 
2/(1 + 2S) and 1/tz < s < min(R,N), ivhere N denotes the number of vanishing moments of the wavelet 
ip. Take the penalty: Pen{f) = YlJ^jg'^'^'^ Hk I^JIP where are the wavelet coefficients of f. Assume 
conditions of Theorem 2, (A7) and [Acorr) hold. 
If k ~ j2-2s/(i+2s)^ ^gj^ deduce from Theorem 2 that 

||/n||s,2,oo = Op(1)' 

n^/(^+'^'||/.-/o|l» = Op(1) 

= Op(1)- 

where \\f\\s,2,^ = sup2/(-i/2) \ejl\^y^\ 
i>is ^ ' ' 



Proof Birge & Massart (2000) establish the entropy of Besov balls -^^ ^^(l), with 2/(1 + 2s) < tt, 
is 1/ = 1/s. One can see that ^Pen{f) ~ ||/|ls2oo = /(/)• Consequently the ^^-entropy of 
the functional set {f,J{f) < c} can be bounded up to a constant by S^^^^. We thus can apply 
Theorem 2. □ 



One drawback of this estimation procedure is its non adaptivity: the optimal value of the 
smoothing parameter A depends on the regularity s of the function, which is unknown. Actually 
the way the penalty term is used cannot lead to adaptivity since it is closely linked to the norm of 
the functional space where the function / lies. Another penalty type may be introduced. 
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2.3.2 Adaptive case 



This section deals with the introduction of an adaptive penalty. We choose to use a -penalty 
on the wavelet coefficients. It is well known that such a penalty leads to soft-thresholding 
wavelet estimators, which are adaptive. We refer to Antoniadis & Fan (2001) for general theory 
on penalization on wavelets coefficients and to Loubes & Van der Geer (2002) for the use of the 
penalty in functional models. 

Our asymptotic results for £^ penalty is the following: 

Theorem 3. Suppose assumptions (AO) to (A3) and assumptions (A4.2) and (A6) hold. Let ^ be a given 
p-vector such that \^\\^ ~ ^oW ~ given by equation (5), with the penalty: Pen{f) = 

AYJi=ig \0f^\ where {0^) are the wavelet coefficients of f. Define fn = argmax K„{f,^). Then, 

{/,||/||oo<Coo},j8GlRP 

for A ~ Y^log(M), we have 

NS/(1+2S) _ 

loiw) ll/,i-/oll"=Op(l). 

Define /3„ = argmax Kn{f^,p>). Then, 

^n||)8,,-/3o|| =Op(1), 

If in addition the covariates of the linear part admit a representation of the form given in assumption {Acorr), 
then: 



The results still hold if the number p of regression covariates goes to infinity provided the sequence 

N (2s-l)/(l+2s) / n1/(1+2s) 

j^P^ j p goes to when n goes to infinity and the sequence ( j P is bounded. 



The proof is given in Appendix and relies on M-estimation techniques of Van der Geer (2000). 

As noted before, assumption (A4.2) is more restrictive than assumption (A4.1). The results of the 
Theorem could probably be extended to exponential tails distributions, weakening assumption 
(A4.2). We refer to discussion on page 134 of Van der Geer (2000) (Corollaries 8.3 and 8.8) for a 
discussion on the price to pay to release assumption (A4.2). 
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The adaptivity is acquired. As explained before, it gives the possibility of a computable procedure, 
without need of a cross-validation step. Yet, the parameter A is only chosen among an asymptotic 
condition and a finite sample application arises that the exact choice of this parameter is important. 
This will be discussed hereafter. Note also that the link with soft-thresholding is not evident, but 
appears in the iterative implementation of the estimators in next section. 

The optimality of the functional part estimation is of course still available in a generalized 
functional model. In such models Antoniadis et al. (2001) and Antoniadis & Sapatinas (2001) 
propose an adaptive estimation when the variance is respectively cubic or quadratic. These 
assumption have to be compared with (A3) and (A4.2). Recently, Brown et al. (2008) introduced a 
method which consists in a transformation on the observations, based on the central limit theorem, 
in order to be able to use Gaussian framework's results. Yet, even if the asymptotic results are 
satisfactory, the implementation needs an important number of observations. We will see in 
numerical study that our procedure is easier to compute. Note also that it is available with the 
presence of the linear part. 

2.4 Algorithm 

This section is only devoted to the adaptive i^-type penalty. Similar algorithms are available 
for other penalties but a cross-validation procedure should be elaborated because of the lack of 
adaptivity. It is worthy to see that the proposed estimators can be easily computed, by the way 
of iterative algorithms. A short simulation study is also given to evaluate the performance of the 
estimation with finite samples. 

The paper of Miiller (2001) concludes that a loglikelihood maximisation is preferable to a 
profile likelihood estimation (we recall this procedure consists in the maximisation of a modified 
loglikelihood criterion when estimating the parametric part). According to the author, backfitting 
computation gives in general better numerical values in estimation than others methods in GPLM. 
In a Gaussian framework, the same conclusion was observed in Gannaz (2007a) (note that time 
differences with Garmaz (2007b) are due to an improvement of the backfitting algorithm used). We 
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therefore implement a backfitting algorithm. 



Backfitting: Let jS*-*^^ be a given p-dimensional vector. For each iteration k, do: 

Stepl = argmaxKif,^^''^) 

f 

Step 2 = argmaxK{f^^+^\fi) 

The algorithm is stopped either when a maximal number of iterations k is attained or when the 
algorithm is stabilized, i.e. when \\^'^^^ — ^^'^^^^jj < || for a given tolerance value 5. The 

returned values are /S,, = /3^^^ and = /^^' with K maximal number of iterations of the algorithm. 

To compute each of the two steps we apply a classical Fisher-scoring algorithm, detailed among 
others page 40 of McCuUagh & Nelder (1989). Usual in generalized models, this algorithm 
consists in building new variables of interest by a gradient descending method, in order to 
apply a ponderate regression on these new variables. Recall the notations of the GPLM given in 
equation(4): one have ?^(X, f) = X^j8 + /(f) and}i{X,t) = fc(f/(X,f)). We will omit the dependence 
to (X, f ) for the sake of simplicity. 

Step 1: 

Note Z^'^'^) = f^^\ Repeat the following iteration for ; = ... /i - 1: 

-Letyj^^'j^ =X^^^^ +f^^'j\ 

We introduce Y^^'i^ = f^''^ + (y - u^^''^) ^ and W^*^'/) = diag ( ^ V 

With an £^-penalty, we establish that f^^'i^^^ is a nonlinear wavelet estimator for the 
observations Y^^'i\ obtained by soft-thresholding of the wavelet coefficients. The 
threshold levels are where Y denotes the forward wavelet transform 

and the inverse wavelet transform. 

Take/^'^+i) =/fr/i). 

Step 2: 

Define jS^'^'^^ = ^^''K Repeat the following iteration for ; = ... /i - 1: 
Let^W) =X/3('^'^')+/('^+i). 
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We introduce Y^'^'i^ = XR^^'^ + (y - U^^'^^ ) ^ and W^^-^) = f ^ V 

Then is the regression parameter of Y^^'f^ on X with ponderations W^^'i\ i.e. 

Take jSC^+i' = 

Maximal number of iterations /i and J2 can be fixed to 1 to simplify the algorithm (this is what is 
proposed actually in Miiller (2001)). In computation studies, no main difference is observed while 
modifying parameters /i and /i- We therefore also decide to fix them equal to 1. 

The initialization values are set as follows: for all i = \,...,n, f^^Hti) = G{yi), with G the link 
function associated to the model (with a slight modification if the value does not exist) and for 
all; = l,...,p,)Sf = 0. 

In the particular Gaussian case, the two steps are non iterative. We explicit how the estimators are 
implemented in a Gaussian framework to ameliorate the comprehension of the algorithm: 

Step 1: The iterate is the wavelet estimator for the observations y — ^^^\ with a 

thresholding on wavelet coefficients with an uniform threshold level A. 

Step 2: We obtain by a maximum likelihood estimation on y — this means that one 

have jS^'^+i) = (X^X)-iX^(y-/('^+i)). 

We can recognize the backfitting algorithm studied in Chang & Qu (2004), Fadili & BuUmore (2005) 
and Gannaz (2007b). In a Gaussian framework, the variance in observations is constant and 
consequently the threshold level is uniform. In a generalized framework, the matrix W ponderates 
the threshold level to take into account the inhomogeneity of the variance. 



In generalized functional models, i.e. without the presence of a linear part, the numerical 
implementation of the estimators proposed here has already been explored by Sardy et al. (2004). 
The authors propose an interior point algorithm based on the dual maximisation problem. 
Comparing to this resolution scheme, our procedure has the advantage of an easy interpretation 
of the different steps in the algorithm. As noted previously, wavelet estimators have also been 
explored by Antoniadis et al. (2001), Antoniadis & Sapatinas (2001) and Brown et al. (2008). These 
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papers need to aggregate the data into a given number of bins. If the two first cited papers allow to 
choose small size of bins, it appears to be quite a constraint for the third one. Actually, it is worthy 
noticing the simplicity of our algorithm, e.g. comparing with those implementations. 

2.5 Simulations 

In this subsection, we give some simulation results. All the calculations were carried out in 
MATLAB 7.6 on a Unix environment. For the DWT, we used the WaveLab toolbox developed 
by Donoho and his collaborators at the Statistics Department of Stanford University (see 
http : //www-stat . Stanford. edu/~wavelab). 

We simulate n = 2^ observations. The covariates are written according to assumption (Acorr)' 
= Si^/^) + ^i' with independent and identically distributed variables following a standard 
normal distribution, and with g{x) = 30 {x — 0.5)^ — 6{x — 0.5)^ + {x — 0.5). Three functional 
parts / were considered : we will refer at 

• the Sinus function, for a smooth function, combination of sinusoidal functions, 

• the Blocs function for a piecewise constant function, 

• and the Pics function for a function presenting high variations. 
These functions are given in Figure 1. 




0.5 1 0.5 1 0.5 1 

(a) (B; .tel 

Figure 1: Functional part of the generalized partially linear model. Figure (a) corresponds to the 
Sinus function. Figure (b) to the Blocs function and Figure (c) to the Pics function. 
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We will more precisely study the estimation quality for 

• a Gaussian distribution; observations are y,- ~ M{rji, (t^), with = tp, 

• a Binomial distribution; observations are y, such that x/i x m ^ B{}ii,m) with m = (p^^ and 
the logit link function f/, = j^^jj^, 

• a Poisson distribution; observations are ~ with the link function f/, = exp(7/;), 
as these distributions seem to be the most frequently encountered in modelisation. 

To conclude with respect to the quality of estimation, we will give the mean value estimated for the 



parameter ^ as well as the standard deviation of the estimation. For the nonparametric part, we will 

1/2 

evaluate the MISE, which is the empirical estimate for E 



All results 



were obtained on 500 simulations with the same covariates x, and the same functional parts /. A 
maximal number of k = 5000 iterations was taken and the tolerance value defined above was equal 
to ^ = lO^'^*^ when applying the algorithm. 



2.5.1 Preliminary study : choice of the threshold level 



The asymptotic behaviour of the estimators only defines the threshold level A up to a constant. 
Yet, numerical implementation needs to determine the exact threshold level A in the algorithm. In 
practice, one can see it has an important impact on the quality of estimation. Figure 2 gives the 
evolution of the MISE with respect to the threshold level in the GPLM with different distributions. 
One can observe that the threshold levels attaining the minima are very different among the 
distribution. 



With a Gaussian distribution, following Donoho et al. (1995), we choose A = 1^21^ log(n). This 
choice overevaluates the optimal threshold level in many cases but is the most often encountered 
in practice. 



17 




Figure 2: Evolution of the MISE with respect to the threshold level in a GPLM with the Sinus 
function. Figure (a) corresponds to a Gaussian distribution. Figure (b) to a binomial distribution. 
Figure (c) to a Poisson distribution and Figure (d) to an Exponential distribution. 

Binomial distribution 

Due to homogeneity reasons, it seems well-adapted to fix a threshold level of the form A = 
cy^log(n), where (p corresponds to the dispersion parameter in distribution (1) and c denotes 
a positive constant. 

To ensure this conjecture in a Binomial setup, we plot the optimal threshold obtained for different 
value of the Binomial parameter m. The study was done on a generalized functional model (GEM) 
and on a GPLM. Figure 3 confirms that the chosen form seems appropriated. 

Linear fittings are given in Table 1. The linear fitting is coherent provided the coefficients. 
The constant c obtained by linear fitting is varying with respect to the simulated samples, but the 
observed variations are small. The mean value is 0.4997 and the standard deviation is 0.011. The 
conjecture of a uniform constant seems acceptable. We therefore choose to take the value c = 0.5 
in the following for Binomial distributions. Note that due to the central limit theorem, one would 
have expected to take c = \/2 for large values of the parameter m, but our simulation study show 
that this would lead to an oversmoothing for small values of m. 



18 




(a) 




Figure 3: Evolution of the threshold minimizing the MISE when estimating the Sinus, Blocs and 
Pics functions in a Binomial setup, with respect to In a GEM on Figure (a) and in a GPLM in 
Figure(b). Calculations were done on 100 samples of size n = 2^. 



Generalized functional model 
Function Sinus Blocs Pics 

R2 coefficient 0.984 0.977 0.961 
constant c 0.483 0.488 0.512 



Generalized partially linear model 



Function 


Sinus 


Blocs 


Pics 


coefficient 


0.947 


0.981 


0.989 


constant c 


0.510 


0.507 


0.498 



Table 1 : Numerical indexes for the regression of the threshold level that minimizes the MISE with respect to 
^^(p log(«) with a Binomial distribution. Calculations were done on 100 samples of size n = 2^. 

Note all the calculations were done with a fixed sample size n = 2^. To better evaluate the form 
of the optimal threshold in practice, one could also study the evolution of the threshold level with 
respect to the sample size n. 
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Poisson distribution 



Estimation in a Poisson functional model has been more intensively explored. Note that Sardy et al. 
(2004) propose a threshold level. The main drawback is that the level given depends on the 
estimated function. Yet, the choice is based on an universal large deviation inequality which does 
not seem to be well-adapted in this procedure. Indeed, due to the iterative interpretation of the 
estimation, the inhomogeneity of the variance of the observations is taken into account within the 
estimation. 

Recently, Reynaud-Bouret & Rivoirard (2010) have developed a procedure based on wavelet hard- 
thresholding estimation for Poisson regression. In their estimation the thresholding step is defined 
directly and not through a penalization procedure like here. The authors then present a detailed 
numerical study showing the high instability of the optimal threshold level with respect to the 
estimated function. 

In our procedure, we can hope for a better stability of the threshold level, considering the fact that 
it takes into account the variance of the pseudo-wavelet coefficients at each iteration. 



Figure 4 represents the evolution of the threshold level which minimizes the MISE with respect to 



y^log(n). Table 2 gives some numerical results associated with Figure 4. 




Sinus 
Blocs 
Pics 



iooW" 



Figure 4: Evolution of the optimal threshold with respect to yTog(n) with a Poisson distribution. 
In a GEM on Figure (a) and in a GPLM in Figure(b). Calculations were done on 100 samples of size 
n = 28. 
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Generalized functional model 

Sample size n 7 8 9 10 

Mean value 1.699 1.783 2.005 2.094 

Standard deviation 0.086 0.110 0.099 0.072 



Generalized partially linear model 
Sample size n 7 8 9 10 

Mean value 1.584 1.675 1.944 2.094 

Standard deviation 0.056 0.053 0.028 0.027 

Table 2: Mean value and standard deviation of the ratio between the threshold level that minimizes the 
MISE and sJ\o^{n) with a Poisson distribution. Calculations were done on 100 samples for each function 
Sinus, Blocs and Pics. 

We observe that the optimal threshold level does not vary significantly with respect to the function 
estimated. This is an advantage on the estimation developed by Reynaud-Bouret & Rivoirard 
(2010). This also confirms that the threshold level proposed by Sardy et al. (2004) is not adapted 
here. It seems in fact that the threshold level advised by Sardy et al. (2004) is more convenient for 
a uniform procedure like in Reynaud-Bouret & Rivoirard (2010) than for the iterative estimation 
scheme implemented here. Moreover it appears that the presence of a linear part in the model does 
not change the threshold level. This stability of the estimation procedure is a interesting property. 
Recall it can be explained by the evaluation of the variance of the pseudo-variables in the iterative 
algorithm. 

Note that we do not obtain an optimal threshold level proportional to yTog(n). As the theoretical 
result is asymptotic, we should study larger values of the sample size n. According to the results 
in Table 2, one may choose a constant approximatively equal to 2 in a functional model and in a 
GPLM. We therefore choose to take the value c = 2 in the following for Poisson distribution. 
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Remark: In a Gaussian or a Binomial regression, one may need the dispersion parameter cp. 
Actually in literature, it is classically estimated at each iteration by 

Due to the bad quality of this estimator in GPLM, we prefer to consider in this paper that the 
dispersion parameter is known. In a Gaussian model, Gannaz (2007b) proposed an efficient QR- 
based estimator for (p. It would be worthy to explore whether it could be extended to generalized 
models. 



2.5.2 Example 1: Gaussian distribution 



Example 1 deals with a Gaussian model. The Gaussian distribution implementation is not a novelty 
for this estimation procedure, and we refer to Chang & Qu (2004), Fadili & BuUmore (2005) and 
Gannaz (2007b) for detailed studies on simulated or real values data. We briefly consider this case 
in order to have a comparison base for other distributions. 




_2 I , , , , J _1 I , , , , J , , , , , J 

50 100 150 200 250 50 100 1 50 200 250 50 1 00 1 50 200 250 

(a: (b) (c) 

Figure 5: An example of a simulated data set in Example 1, with the function Sinus in Figure (a), 
the function Blocs in Figure (b) and the function Pics in Figure (c). 

The signal-to-noise-ratio (SNR) of a signal is defined as the norm of the ratio of the mean value 
with respect to the standard deviation. In GPLM the SNR for the nonparametric part, noted SNRf 
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and the SNR for the linear part of the model, noted SNR^, are respectively equal to 

SNR} = It— 



and SNR2 = if O^lM 



With a high SNR, say approximatively 9, one can expect a good quality of estimation, while with a 
small value, like 2, the quality of estimation cannot be satisfying. 

Table 3 gives quality indexes for the estimation of the model on 500 simulations. As already noted 
in the different papers using this estimation procedure, the numerical results appear to be of fairly 
good quality. 



Function 


Estimation of /3 


Estimation of / 


Time 




Mean SNR 


Mean estimation 


Mean SNR 


MISE 




Sinus 


8.7 


0.9993(0.0098) 


9.1 


0.1639 


1.31(326) 


Blocs 


6.3 


0.9991(0.0168) 


9.0 


0.2197 


0.49(118) 


Pics 


2.2 


1.0002(0.0296) 


1.6 


0.3367 


0.64(153) 



Table 3: Measures of quality the estimates over the 500 simulations in Example 1 with n = 2^. For the 
parameter /S, the true value is 1 and the value given is the mean value and standard deviation appears in 
brackets. In the column Time, the mean of numbers of iterations in the algorithm is given in brackets. 



For the value of the signal-to-noise ratio (SNRf — 9) adopted in our simulations for the 
nonparametric part, the estimator is nearly able to detect the discontinuity of the Sinus function, 
as shown in Figure 6 (a). Results for the nonparametric part are very similar to those obtained 
in a nonparametric signal denoising, without a linear part. The asymptotic behaviour of the 
estimates effectively states that the presence of the linear part does not affect the estimation of the 
nonparametric part, under assumption {Acorr)- See Gannaz (2007b) for a more argued discussion 
on the influence of the linear part on estimation, explained through a parallel established with 
robust M-estimates. 
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2.5.3 Example 2: Binomial distribution 

In Example 2, we consider a Binomial distribution: observations Y, are such that x m are 
independently drawn from Binomial distributions B{}ii, m) with the parameter m equal to m = (p^^. 
The link function considered is the logistic. The mean is thus equal to 

exp(Xf/3o+/o(^,)) 

The logistic link makes sense if the canonical parameter ?/(■) belongs to the interval [—4,4], as one 
can see for example page 28 of Fahrmeir & Tutz (1994). Consequently, to get a SNR of order 9 one 
may choose a parameter m in the binomial distribution equal approximatively to 200. Note that 
Binomial distribution is often chosen to classify the data. A number of 200 classes is not adapted 
for real data applications ; actually, a classical number of classes is 2, 3 or 4, which will lead to small 
SNRs and thus to a bad quality of estimation. 

Due to this remark, we choose here to make a compromise and to apply the algorithm with a 
parameter m equal to 24, which corresponds to 25 classes. This choice is not coherent with a 
classification problem but allows to consider much reasonable SNRs. Indeed, with small values 
of m one is not in capacity of identifying the functional part in a GPLM. The observations of a 
simulated sample are represented in Figure 7. The results for this example are summarized in 
Table 4. 
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50 1 00 150 200 £50 50 1 00 150 £00 250 50 100 150 200 250 

[a) (IJ) (c) 

Figure 7: An example of the observations obtained on a simulation in Example 2, with the function 
Sinus in Figure(a), the function Blocs in Figure (b) and the function Pics in Figure (c). 



Function 


Estimation of ^ 


Estimation of / 


Time 




Mean SNR 


Mean estimation 


Mean SNR MISE 




Sinus 


2.1 


0.8559(0.0597) 


3.0 0.7656 


17.0(3922) 


Blocs 


2.0 


0.8581(0.0607) 


2.9 0.7285 


18.1(4142) 


Pics 


1.9 


0.9487(0.0426) 


1.4 0.4831 


9.8(1886) 



Table 4: Measures of quality the estimates over the 500 simulations in Example 2 with n = 1?. For the 
parameter /S, the true value is 1 and the value given is the mean value and standard deviation appears in 
brackets. In the column Time, the mean of numbers of iterations in the algorithm is given in brackets. 

The bad quality of estimation is due to the bad SNRs of the model. In order to better analyse the 
results, we simulate a GPLM with a Gaussian distribution and the same SNRs. This was done by 
modifications on the dispersion parameter or choosing proportional covariates in the linear part. 
We obtain results given in Table 5. 

Comparing Table 4 and Table 5, it appears that the estimation quality is better with a Gaussian 
framework than with a Binomial. Yet, the estimates in the Binomial GPLM seems satisfying for 
such SNRs. Figure 8 confirms that the estimate of the functional part only renders the shape of the 
function but does not allow to recover the function. Again this is in coherence with the small SNR. 
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Function 


Estimation of 


Estimation of / 


Time 




Mean SNR 


Mean estimation 


Mean SNR 


MISE 




Sinus 


2.3 


1.0008(0.0333) 


3.0 


0.4219 


1.04(253) 


Blocs 


1.9 


1.0033(0.0373) 


2.9 


0.4547 


1.87(451) 


Pics 


2.2 


1.0002(0.0296) 


1.6 


0.3367 


0.64(153) 



Table 5: Measures of quality the estimates over the 500 simulations with a Gaussian distribution with similar 
SNRs than in Example 2 with n = 7?. For the parameter )3, the true value is 1 and the value given is the mean 
value and standard deviation appears in brackets. In the column Time, the mean of numbers of iterations in 
the algorithm is given in brackets. 




0.2 0.4 6 0.8 1 2 0.4 0.6 O.B 1 0.2 0.4 6 8 1 

[a) (IJ) (c) 

Figure 8: An example of estimation of the nonparametric part in Example 2, with the function Sinus 
in Figure(a), the function Blocs in Figure (b) and the function Pics in Figure (c). 

When comparing the time costs, one can see that the algorithm is faster in a Gaussian framework. 
The reason is the fact that each step of the backfitting algorithm described above are exact calculation 
in a Gaussian case. While with a Binomial distribution each of these steps has to be solved 
iteratively. The number of iterations necessary to stabilize the algorithm is thus higher with the 
Binomial distribution. 

This can also explain the lower quality of estimation in the Binomial distribution : the mean number 
of iterations in the algorithm is higher than 3900 with the Sinus and the Blocs functions, where the 
difference with the Gaussian distribution is the most important. Recall the maximal number of 
iterations is 5000. Thus perhaps sometimes the algorithm is not stabilized when it stops, explaining 
the lower quality. 
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2.5.4 Example 3: Poisson distribution 



In Example 3, we consider a Poisson distribution: ~ with /^,- = exp(?/;). Observations of a 

simulated data sample are represented in Figure 9. The results for this example are summarized in 
Table 6. 



300 
250 
200 
150 
100 
50 




(aj 



(c) 



Figure 9: An example of the observations obtained on a simulation in Example 3, with the function 
Sinus in Figure(a), the function Blocs in Figure (b) and the function Pics in Figure (c). 



Function 


Estimation of )3 


Estimation of / 


Time 




Mean SNR 


Mean estimation 


Mean SNR 


MISE 




Sinus 


9.1 


1.0971(0.0763) 


7.9 


0.5548 


11.4(2701) 


Blocs 


3.1 


1.2020(0.1201) 


8.2 


0.4156 


18.8(4479) 


Pics 


3.0 


1.1876(0.1751) 


7.1 


0.4809 


19.6(4528) 



Table 6: Measures of quality the estimates over the 500 simulations in Example 3 with n = 2^. For the 
parameter /3, the true value is 1 and the value given is the mean value and standard deviation appears in 
brackets. In the column Time, the mean of numbers of iterations in the algorithm is given in brackets. 



Similarly to what was done in Example 2, we also simulate data sets with similar SNRs with a 
Gaussian distribution, in order to be able to compare the qualities of the estimates to a given 
reference. Results obtained in a Gaussian case are given in Table 7. 
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Function 


Estimation of 


Estimation of / 


Time 




Mean SNR 


Mean estimation 


Mean SNR 


MISE 




Sinus 


9.3 


0.9990(0.0101) 


7.9 


0.1875 


0.61(147) 


Blocs 


3.2 


0.9943(0.0311) 


8.2 


0.2261 


1.39(336) 


Pics 


3.0 


0.9957(0.0332) 


7.1 


0.1416 


0.74(178) 



Table 7: Measures of quality the estimates over the 500 simulations with a Gaussian distribution with similar 
SNRs than in Example 3 with n = 7?. For the parameter )3, the true value is 1 and the value given is the mean 
value and standard deviation appears in brackets. In the column Time, the mean of numbers of iterations in 
the algorithm is given in brackets. 

The quality of estimation seems good, even if lower than in the Gaussian distribution framework. 
Like for the Binomial distribution, each step of the backfitting algorithm is solved by Fisher-scoring. 
Consequently, the time of calculation is higher than in the Gaussian model. Like with the Binomial 
distribution, the lowest qualities are observed when the mean of iteration numbers is high. Clearly, 
the fact that the mean of iteration numbers is more than 4400, with still a maximum at 5000, means 
that the algorithm often would have need more iterations to get stabilized. Thus the quality should 
probably be better with more iterations. 




0.5 1 0.5 1 0.5 1 

m ffil: (c) 

Figure 10: An example of estimation of the nonparametric part in Example 3, with the function 
Sinus in Figure (a), the function Blocs in Figure (b) and the fimction Pics in Figure (c). 

Figure 10 gives an example of the estimation of the functional part obtained. One can see that 
visually the quality is satisfying. Note that one cannot compare with the Binomial case because of 
the difference in SNRs. 
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Conclusion 



This paper proposes a penalized loglikelihood estimation in generalized partially linear models. 
With appropriate penalties, we establish the asymptotic near-minimaxity of the estimators for both 
the linear and the nonparametric parts of the model. The result holds for a large class of functions, 
possibly non smooth and the conditions of correlation between the covariate design of the linear 
part and the functional part are similar to literature's. Moreover with an £^ penalty on the wavelet 
coefficients of the function (leading to soft thresholding), the procedure appears to be adaptive 
relatively to the smoothness of the function. In this particular case we develop an implementation 
and observed satisfactory results on simulation studies. 

Our ongoing research may deal with the estimation of the dispersion parameter in generalized 
partially linear models. Developments in non-equidistant designs for the nonparametric part 
should also be explored. 
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3 Appendix 

The proof is structured as follows: in Section A, we justify the fact that under a structure [Aeon-) 
of the covariates, with a well-chosen penalty, the scale part of the linear part does not intervene in 
the quality of the estimators. Section B studies the behaviour of the nonparametric part /, while 
Section C presents the asymptotic properties of the estimator for the linear regressor /3. 

A Decorrelation of the scale components 

We suppose here that the assumption (Aeon-) holds and that in the penalty only intervene the 
wavelets coefficients. To be more precise, given a vector of real values e = (ei, . . . , e„ ) ^, the discrete 
wavelet transform of e is given by a n x 1 vector comprising both discrete scaling coefficients, 
noted (0?^ jJkGZf and discrete wavelet coefficients, {(^Jl)j>js,kGZ- The wavelet inverse transform of 
the scaling coefficients {6j^ ]Jkei (fixing the wavelet coefficients equal to zeros) is noted e^ and the 
wavelet inverse transform of the wavelets coefficients (f^jjj.)jtez (fixing the scale coefficients equal 
to zeros) is noted e^ . The penalty is in this section assumed to be such that Pen{f) is a function of 
only. To better apprehend the mechanisms, let us decompose each components X, and / in the 
criterion into their scale parts and and their wavelets parts X-^ and 

Recall the estimation procedure consists in maximizing the criterion given in equation (5), first 
with respect to the functional part / and afterwards with respect to the linear component. For any 
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function / and any p-dimensional vector j8, the criterion Kn{f, j8) given in (5) can be written 

Kn{f,^) = f^i(y„Xj'^^ + a{f,xS^){t,)) - Pen{a{f,X)) =: U{a{f ^),X'^ (6) 



n 

L 

! = 1 

rST 



with a{f, jS) (t,) = Xf^ + /(f,) . It is worthy noticing the criterion L„ {h, X^/3) does not depend 
of X^ for a given function h. For a fixed vector )3, = argmax K„ (/, /3) = — X^^)3 with = 

flrg?nflx L„(/2,)3). Replacing /by its estimation, _K„(/yg,)3) = L/Li ^ (1///^^"'^]^ + ^jS^^')) ~ ^^"(^jS)' 
Consequently, the estimate = argmax K„ {fn, )8) is in fact defined independently from X^ in such 

a framework. 



It is clear that the properties established for the argument maximizing the criterion X„ are available 
when maximizing the criterion L„. When the structure of the covariates and of the penalty are 
well-adapted we will then consider without loss of generality that the covariates (X,),=i^ „ contain 
only the wavelets components (X-^),=i^ „. 



B Estimation of the nonparametric part by 

In this Section, we have to distinguish between the non-adaptive type penalties of Section 2.3.1 and 
the penalty on wavelet coefficients. The whole scheme is the same, so the second case will be 
less detailed. 



B.l Nonadaptive case 

Here, the penalty is supposed directly linked with an entropy control through assumptions (A5) 
and (A6). 
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B.1.1 A Taylor expansion 



The difference between the criterions Kn{f, j8) and K„{fo, j8) is equal to: 



-(X„(/,)S)-X„(/o,)S)) 



Note TjQ^i = /3q + faiti). A Taylor expansion of degree 2 at the points {(.{yi, ^o,i))i gives: 

l;{Kn{f,^)-Knifo,^)) 



1 ^ 



1 



LKVi^Vo,,) ifiti) -m)) - 1/2- {f{U)-fo{ti))) 



! = 1 



A2 



Co 



With Ti(/, j8) < I sup, sup^ 1 E.- (|xr()8 - + /(f,) - /o(fO f + 1/(^0 - foiti) p) ■ 



Let us study the term Ti (/, /3) . Under assumption {A3), 



Using the convexity of x i— >• , it comes: 



Ti(/,)8) < csie 



1 



! = 1 



3 1 



X: X[03-)So) +;El/(fO-/o(f,)l 



/=1 



Til 



Tl2 



Then, one can easily see that when assumptions (Al) and (A2) hold, Tn = o{p/n) and that 
^12 = ^ E \f{U) ~ fo{ti)f- Consequently, if pv^Jn is bounded then for every function / satisfying 

f Efl/(^,)-/o(fOl'<cwehave 



vlTi < C. 



When v„ = n^/t^+f) -^Yith v > 0, the sequence pv^/n is bounded if pn ^/(^+^) is bounded. 
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Let 

2 T 2 

Snif) = ^ {K„{f,^) - X„(/o,)3) -A, + A2-Co) + ^ {Pen{f) - Pen{fo)) . 

3 

For every sequence u„ —> and for all function / satisfying ^ JLi Ifi^i) ~ foi^i) I ^ we prove 
UnSnif) 0. Fix n„ a decreasing sequence going to 0. Convexity argumentation (Theorem 10.8 
in Rockafellar (1970) with a change of variable on / to obtain a set independent from n) gives: 
«^P|e,|/(,)_/„(,)|3<.""S,(/) ^ 0. Consequently: 

sup UnSnif) ^ 0, whereN(/) = vl\\f - fo\\l + J{f), (7) 

N(/)<c 

noticing that L; |/(^/) — /o(^/)|'^ < — /o||^ ||/ — /o||oo, and that we consider the maximisation 
problem on the set {/, ||/||oo < Coo}- 

Remark: We need / i— > Sn{f /Vn) to be concave for the norm ||. — /o||ii for every n. It is worthy 
noticing the penalty Pen does not intervene in the definition of S„ . 



B.1.2 Behaviour of the different terms 



Similarly to Bai et al. (1992), we suppose N(/) > c„ with c„u„ — > oo. Then one can build a sequence 
c'„ satisfying c'„u„ — > oo et c'„ < c„ and such that supj^^^^<^, w^tSfc goes to when k goes to infinity. 

For N{f) = c'„, this means: 

2 1 2 

^ (K„(/,)3) - K„(/o,/3) - Al + A2 - Co) + ^ (Pen(/) - Pen(/o)) = 0(1)- (8) 

• Control of the term Ai = 1 ZU KVr^ W) {fiU) - /o(f/))- 

Following Mammen & Van der Geer (1997), let ^ be a set of uniformly bounded functions on 
[0, 1] satisfying 

lim sup sup 5^'7i{5,A, \\-\\n) < ^■ 

Suppose ]-+/(/) ^ Then, similarly to Mammen & Van der Geer (1997), when assumption 
(A4.1) holds: 

sup ^TT—-. TT^ = Op(1)- 
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Under assumption (A6), if N{f) = c'„, then /(/) < c'„, and consequently, we obtain that with 

v„ = n^/i^+y): 

sup v^Ai = Op(c',). 

/,ll/IU<Coo,N(/)<c;, 

Control of the term A2 = HLi K^/o,/) (/(^i) - h{U))f ■ 

Thanks to assumption (A3) we immediately get: v^A2 > bo {'fn\\f — fo\\n)'^ and thus 
infN(/)=c;, M > boc'n^- 

Control of the term Q = i LU KVo,i) {{^ - ^oV^i) ifi^i) " MU)) • 
Cauchy-Schwarz inequality applied to Co gives: 

Co < supK,/o.) \\m)-fom„ (^Eii>^'i!')'^'iii3-iSoii- 

When \/n||j8 — < c and assumptions (Al) to (A3) hold, we get: 

Co<C(i3)n-i/2||/-/o||,, 
with C()3) independent from /. Thus: 

sup vlCo = Op.s. (4) , 

N(f)<c'„ 

for u„ = nT^. 



B.1.3 Conclusion 



Using together the convergence (7), the bounds of Ai, A2 and Co and assumption (A5), we obtain: 

sup u„ ^ {Kn (/, )3) - X„ (/o, )3) ) < 

{f,N{f)=c'„} " 

with a probability going to 1 when n goes to infinity. Concavity of assumption (A6) (because it 
implies the decrease of the slopes) allows to extend this result to 

sup Un^ {K„{f,^) - K„{fo,^)) < 0. 

{f,N{f)>c'„} " 

The estimator is the argument realizing the maximum of K„ (-, )S) and so P (N(/) > c'„) ^ 0. 
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B.2 Adaptive case, with an £^ penalty 

In this section Pen{f) = Yl'i=i, l^f^l with 6^ vector of the wavelet coefficients of /. In the first time, 
we are willing to study how this penalty can lead to a functional space for which we can control 
the entropy as above. The underlying idea is to distinguish the behaviour of the penalty among the 

/ xl/(l+2s) 

resolution degree of the wavelets coefficients. We introduce iw = { \og{n) ) • resolution 

level is higher we will establish a link with an other penalty, but if the resolution level is smaller, 
then we are going to see that the norm \\-\\n offers a sufficient control. 



B.2.1 Highest resolution levels 



Let Pen, Jf) he Pem^if) = ELijOf] 



Minoration 

We aim to bound above the truncated penalty Penj^{f) using Ji„{f) = "^'"^^EL/w l^/ll^ with 
|0 = 2/(1 + 2s). To this objective, we decompose as follows : 

i>iwAOr\<^ i>iw,\df'\>e 

• Holder inequalities gives: 

E \or\p<mor\<^'-' ( e ■ 

Observing that < A} = Ll|e,.|/A<i > E|ew|<. 1^1^ we have 



The bound E|0W|>, \9f^\ > e'^'P L\ey^\>, \9f^\P is evident. 
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It follows that: Perii^if) > E^-PnP'^]i^{f). Taking e = A, Vn = (Mniy^ ^ ''^^^'^ and A > ^log(n), 
we obtain that ^Pen,^(/) > /,^(/). 



Maj oration 

We are now willing to bound Perii^ ifo)- Using Holder inequality, 

l/TT 



because as xp and / admits compact supports the number of non-zero coefficients at a resolution 
level is equivalent to 21 Thus, PeUiJ/o) < Zj>j^ 2-i'~'-^/^h^/^fo\\s,u,co, and 



"-PenrMo)<^iv!'''''^\\fo\U^- 



/. . NN s/(l+2s) / n1/(1+2s) 

For v„ = ( -iH j and A ~ ^/log(n), it is sufficient that iw > ( ) to get 

^PentUfo) < \\fo\\s,n,oo. 



B.2.2 Lowest resolution levels 



We are interested in the study of Peni^{f) = J^'-Zi^ \ef\. Note that {Peni^if) - Peni^{fo)\ < 
L'iZi^ ll^n - Ki\\^'i=is - Cl - Using Cauchy-Schwarz inequality, 

\Pen,,{f) - Pen,^{fo)\ < iU' (E 1^^^ " Cl')''' = {niw)'^'\\f - foWn- 

/J . NNs/(l+2s) / n1/(1+2s) 

For Vn = ( „ j and A = yTog(n), it is sufficient that < to deduce 

^\Pen,^{f)-Pen,^{fo)\<v4f-fo\U. 



To conclude the penalty study, we have established that, with the adapted choice of iw^ 

{Pen{f)-Pen{fo))-R> hjf) with \R\ < ||/o||,^,,^ + y„||/ - /o||,. 
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B.2.3 Bias term Ai = 1 Ef^^ l{yi, tjoj) {f{t{) - fo{ti)). 



LetN(-) and /(•) be defined as follows : N(/) = vl\\f - fo\\l + Ji„{f) and /(/) = n-P/^^^^^.^ 

We are going to put in evidence the link between these two quantities. By Cauchy-Schwarz 

inequality. 



t=is 



2s/(l+2s) 



Then for y„ = (^)^^' " ' as < (n/ log(n))i/(i+2'^), we have: L]Z,jOf^ - e^-\P < 
^nWf - fo\t When vlWf - /of = c'„ ^ oo, it implies that /(/ - /o) < N{f) + hj/o). 



We will consider the functional set A = {/, /(/) < Q}. Lemme 4.3. of Loubes & Van der Geer 
(2002) states the entropy of A satisfy n{A,S, ||.|j„) < AS ^-p (log(n) + log(l/^)) where A is a 
constant. With 1/ = ^ = 1/s, this can be written jf^^^n{A,u,\\.\\ny^^du < A^/log{n)R^-"/^. 



Corollary 8.3 of Van der Geer (2000) implies that under assumption (A4.2), for R > 3 / a > 1/n, and 
C > 1, we have: 

Pf sup -^^^f^i{y^,r]o,i)f{ti) >2ACoS'-'/^c] < CoexY>{-Ald-'C). 

\Mg,g€A,\\g\\„<R} Vl0S(")V" i=l J 

Following the proof of Lemma 8.4 in Van der Geer (2000) we can deduce the existence of a constant 
c depending only of A, v, R and of constants in (A4.2) such that: 



forall T > c, P sup SU^(y..^o 0/(^0 > ^ < cexp(-TVc^). 



Assume g = N(/f+/° (/o) • According to previous developments, g belongs to ^ and \\g\\n < 1- Thus, 



With Vn 



log(") 



1/(2+1') 



sup vlA, = Op(n-^/Vl°S(")^"^^'''^"'''"''''^"'') 

f^N{f)^c'„ 

^ ,j l/2+v/4x 

= Op(c„ ) 



As < 1/ < 2, immediately, sup^ N{f)=c' ^n^i = ^fi^n) when c', — > co 
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B.2.4 End of the proof 

The scheme is very similar to what has been presented in the nonadaptive case and will not be 
detailed here. 



C Estimation of the linear part: study of the behaviour of 

With a Taylor expansion of order 2, 

i = l ! = 1 



i=l 1=1 



2 



Bi B2 

J2Hvo,) (xf03-)So)) {f^{td-Mti)) +T2O3) 



;=1 



Co 



C.l Behaviour of the different terms 



C.1.1 Control of Co. 



We should first study the convergence of the rest term T2 but the mechanisms which intervene 
in the majoration appears more clearly in Co. Write Co = ll/^g — /ollooEyL^ Coj{^- — /Sq^ with 

(//2(f,)-/o(t/)) 

Coj = EUHvo,)x^,j P- ■ 

Without assumption (Acorr) 
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Applying Cauchy-Schwarz leads to 

l|CO;l|<||/^-/0|Ufo. 

Under assumptions (Al), (A3) and what precedes, it comes y„|| Co || = Op(1)- 

With {Acorr) 

We assume the penalty does not deals with scale components of the nonparametric part of the 
model. As explained in Section A, we consider actually that for all / = 1, . . . , n, the covariate X, has 
a null scale representation. Assumption {Acorr) implies moreover that the polynomial functions 
gj have a null wavelet representation and so Co = Ya^i 'b{Vo,i) (^^J^'^i^ ~ ^o)^ {f^^^i) ~ hi^i))' 
where represents the wavelets part of 

The wavelet transform is orthonormal and thus has the same properties of ^j. When these 
variables satisfy an subgaussian assumption such as (A4.1) or (A4.2), we can apply a control of the 
term using entropy similar to what has been done before, lying on Van der Geer (2000). 

When the entropy satisfies 7^(^,(5, < A5^^ , following Mammen & Van der Geer (1997) we 
get: 

CO; 

""P ,.v rrvn = Op(i). 

Using \\J^ - fo\\n = Oriv^^) for ^\\^- < c, it comes: 

Co = Op(1) hn' V n-i/(2+")y-^/^ _ ^^j. 

/=i 

Using Cauchy-Schwarz inequality: 



Co = Op(1) ( V n-i/(2+v)y-^'/^ V^p'^H^ - 



Providedpn= (^z;,^^ M^^/^^+'^^y '''^ p^/^ ^ 0, we have Co = op(l) for /3 such that 11/3 - /3o|| < 
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When the subgaussian assumption (A4.2) is weakened in the exponential tails assumption (A4.1) 
we can probably obtain a similar result using Lemma 8.4 of Van der Geer (2000). 



With the £^-penalty the majoration for the (^-entropy becomes ^{{AfS, \\.\\n) < ^^''(log( 
log(l/^)). Using again Van der Geer (2000) we have 

(2s-l)/(l+2s) 



and so 



Co = op(i: 



sup Coj = Vn( , " 



(2s-l)/(l+2s) 



It is sufficient to suppose |0„ = yi^^ 



log(n) 

(s-l/2)/(l+2s) 



p ^ Oto get Co = op(l) for all v/"||)3 - jSgll < c. 



C.1.2 Control of T2 



The rest term in the Taylor expansion T2 is bounded by: 



T2<cstel^ [xJ{^-^,)+f^{t,)-fo{ti)) -(^/^(f,)-/o(i, 
We decompose in three terms: 



T21 < sup ||Xf II ll/S - )3J ^ |X[()S - )3o)||/^(f,) - /o(^,)|■ 
Proceeding as for the term Co, when ~ /^oII — obtain: 

T21 <sup||X[||||)S-)3o|| Op (pn). 

i 

Assumption (A2) implies T21 = Op (p„ ) . 
• T22 = L|xf(/3-/3o)|/^(i0 -fo{ti)\^. Note that we have 

Ti2<%-h\uY.\^j{^-mJ^{ir)-hm- 

And when ^/n\\^ ~ ^oW ^ the asymptotic behaviour is T22 = Op(ll/jg ~ /ollooPn). 
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• 723 = E 



Xf (/S - )3o) . Under assumption (Al) and (A2), T23 < o(l) - ^o\\n) ■ 



We have proved that for all )3 satisfying \/n\\^ ~ ^qWu < C2, the term T2 goes to when n goes to 
infinity, provided that p„ — > 0. 



C.1.3 Control of B2 = 1/2^^1 HVo,i) MiP - A 



Recall that assumption (A3) stands that ELi 'Hvo,i)^J^i goes to a strictly positive matrix Eq when 
n goes to infinity. Thus, up to a constant c, we have B2 > '^7(^o)"||)S — /3g||^, with 7(^0) smallest 
eigenvalue of So . Consequently inf||yg yg ||^^, B2 > c7(Zo)nc'„^. 



C.1.4 Control of Bi = KVu Vo,i) (^Ji^ " A 



^0- 



Write Bi = B()3 — Ao) with B = ^(yi/ f/aO^f an n x p-dimensional matrix. The norm of B 
satisfy the inequality: 



When assumption {Acorry) is satisfied, the law of large numbers implies that 

||B|l'<<^'Ec^/op.s.(l). 

7=1 



Consequently 



sup Bi = Op(4)- 

/^iiA-Aoih':;, 
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C.2 Conclusion 



We have proved that 

goes to for all /3 such that \/n\\^ ~ ^q\\ ^ Using convexity (see Rockafellar (1970) as before) the 
convergence is available for the supremum supy^y^^ ^n{^) — > 0. 

Suppose now that ^/n\\^ ~ ^oW — with c„ — > oo. Then we can build a sequence c', statisfying 
c'„ 00, c', < Cn and sup^||^_^^|| R„{^) 0. Note that 



Rni^) = [K„{f^.^)-Kn{fo,^o)) - [Uf^^f^o)-Kn{fo.^o))-^l + ^2. 

We have Kn{f^, jSg) — Kn{fo, )3q) < 0. Using moreover the studies of terms Bi and B2 show that 



sup (x„(/.,)8) - K„{fo,^o)) < OrK) - K 

Vrr\\^-^J„=c'„ 



/2 



with k strictly positive constant, and thus 



sup (Kn{fB^^)-Ufo,^)) < 

v^llj8-/3„l|„=c;, 

with a probability going to 1 when n goes to infinity. 
Convexity gives: 

sup X„(/^,)3)-X„(/o,)8o) <0 

with a probability going to 1. As the estimator )3„ minimizes K„{f^, )8), we conclude 

P(v^ll3„-/Soll >c„)^0. 
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